esercizi di laboratorio su dataset

Compito del 1 febbraio 2017

Let us consider the dataframe mtcars, which comprises the fuel consumption and 10 aspects of design and performance for 32 automobiles (1970s models). The help file is given below

mtcars R Documentation

Motor Trend Car Road Tests

Format

A data frame with 32 observations on 11 (numeric) variables.

[, 1] mpg Miles/(US) gallon
[, 2] cyl Number of cylinders
[, 3] disp Displacement (cu.in.)
[, 4] hp Gross horsepower
[, 5] drat Rear axle ratio
[, 6] wt Weight (1000 lbs)
[, 7] qsec 1/4 mile time
[, 8] vs Engine (0 = V-shaped, 1 = straight)
[, 9] am Transmission (0 = automatic, 1 = manual)
[,10] gear Number of forward gears
[,11] carb Number of carburetors

Describe how to perform a preliminary data analysis on this dataframe, using suitable R commands.

Scopo e analisi delle variabili

Si vuole creare un modello statistico di regressione per analizzare i consumi medi di alcuni modelli di auto presenti nel dataframe mtcars.

La variabile risposta è mpg che rappresenta il numero di miglia (americane) percorse (in media) con un gallone di carburante.

Analizziamo le variabili

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Ci sono alcune variabili da convertire

mtcars$cyl <- as.integer(mtcars$cyl)
mtcars$hp <- as.integer(mtcars$hp)
mtcars$vs <- factor(mtcars$vs)
levels(mtcars$vs) <- c("V-shaped", "straight")
mtcars$am <- factor(mtcars$am)
levels(mtcars$am) <- c("automatic", "manual")
mtcars$gear <- as.integer(mtcars$gear)
mtcars$carb <- as.integer(mtcars$carb)
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : int  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : int  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : Factor w/ 2 levels "V-shaped","straight": 1 1 2 2 1 2 1 2 2 2 ...
##  $ am  : Factor w/ 2 levels "automatic","manual": 2 2 2 1 1 1 1 1 1 1 ...
##  $ gear: int  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: int  4 4 1 1 2 1 4 2 2 4 ...
summary(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 Min. :2.760 Min. :1.513 Min. :14.50 V-shaped:18 automatic:19 Min. :3.000 Min. :1.000
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 straight:14 manual :13 1st Qu.:3.000 1st Qu.:2.000
Median :19.20 Median :6.000 Median :196.3 Median :123.0 Median :3.695 Median :3.325 Median :17.71 NA NA Median :4.000 Median :2.000
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 Mean :3.597 Mean :3.217 Mean :17.85 NA NA Mean :3.688 Mean :2.812
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 NA NA 3rd Qu.:4.000 3rd Qu.:4.000
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 Max. :4.930 Max. :5.424 Max. :22.90 NA NA Max. :5.000 Max. :8.000

Analisi della variabile risposta

library(moments)

hist(mtcars$mpg, probability = T, breaks = 10)
lines(density(mtcars$mpg), lwd=3)
lines(
  seq(0, 50, 0.01), 
  dnorm(seq(0, 50, 0.01), mean = mean(mtcars$mpg), sqrt(var(mtcars$mpg))),
  col = "cyan")
abline(v=mean(mtcars$mpg), col = "red")
abline(v=median(mtcars$mpg), col = "green")

skewness(mtcars$mpg)
## [1] 0.6404399
kurtosis(mtcars$mpg)
## [1] 2.799467

Il grafico risulta leggermente decentrato rispetto alla distribuzione normale, proviamo una trasformata

logmpg <- log(mtcars$mpg)
hist(logmpg, probability = T, breaks = 10)
lines(density(logmpg), lwd=3)
lines(
  seq(0, 50, 0.01), 
  dnorm(seq(0, 50, 0.01), mean = mean(logmpg), sqrt(var(logmpg))),
  col = "cyan")
abline(v=mean(logmpg), col = "red")
abline(v=median(logmpg), col = "green")

skewness(logmpg)
## [1] -0.02241292
kurtosis(logmpg)
## [1] 2.64908

La trasformazione logaritmica sembra migliorare l’andamento della densità di mpg, visto che ne aumenta sensibilmente la simmetria a discapito di una leggera diminuzione dell’ indice di curtosi.

mtcars$logmpg <- logmpg

Le altre variabili

Istogrammi e barplot delle variabili

#par(mfrow=c(2, 5))

barplot(table(mtcars[, "cyl"]), xlab = "Numero di cilindri")

hist(mtcars$disp, breaks = 10)

hist(mtcars$hp, breaks = 10)

hist(mtcars$drat, breaks = 10)

hist(mtcars$wt, breaks = 10)

hist(mtcars$qsec, breaks = 10)

barplot(table(mtcars[, "vs"]), xlab = "vs")

barplot(table(mtcars[, "am"]), xlab = "am")

barplot(table(mtcars[, "gear"]), xlab = "gear")

barplot(table(mtcars[, "carb"]), xlab = "carb")

#par(mfrow=c(1,1))

Relazioni tra le variabili

for (i in c(2, 8, 9, 10)) {
  boxplot(mtcars$logmpg~mtcars[, i], xlab = names(mtcars)[i])
}

for (i in c(3, 4, 5, 6, 7, 11)) {
  plot(mtcars$logmpg~mtcars[, i], xlab = names(mtcars)[i])
  lines(lowess(mtcars$logmpg~mtcars[, i]))
}

Sembra esserci una forte correlazione lineare tra log(mpg) e cyl, vs, am e wt. Anche per le variabili hp, disp e drat sembra esserci correlazione, ma non di tipo lineare.

Verifichiamo le correlazioni con gli indici di Pearson e Spearman

df <- with(data = mtcars, matrix(
  c(cyl, wt, disp, hp), 
  byrow = FALSE,
  nrow = length(mtcars$logmpg),
))
colnames(df) <- c("cyl", "wt", "disp", "hp")

cor(mtcars$logmpg, df, method = "pearson")
cyl wt disp hp
-0.8546921 -0.8930611 -0.8804044 -0.7894727
cor(mtcars$logmpg, df, method = "spearman")
cyl wt disp hp
-0.9108013 -0.886422 -0.9088824 -0.8946646
# ha senso calcolare la correlazione su variabili categoriali?
#cor(mtcars$logmpg, as.numeric(mtcars$vs), method = "pearson")
#cor(mtcars$logmpg, as.numeric(mtcars$vs), method = "spearman")
#cor(mtcars$logmpg, as.numeric(mtcars$am), method = "pearson")
#cor(mtcars$logmpg, as.numeric(mtcars$am), method = "spearman")

Il valore degli indici conferma alcune impressioni, ma ne modifica altre:

  • la forte correlazione lineare (negativa) è confermata per cyl e wt,

  • il confronto tra l’indice di pearson e spearman per hp e disp suggerisce la presenza di correlazioni tra queste due variabili e logmpg, ma non di tipo lineare (probabilmente sono inversamente proporzionali),

  • per quanto riguarda drat, il confronto tra i due indici suggerisce una correlazione lineare in contrasto con l’ipotesi fatta in precedenza. Tuttavia questa correlazione è molto debole e probabilmente dovrà essere ignorata in fase di costruzione del modello.

Proviamo a trasformare hp e disp

plot(mtcars$logmpg~I(1/mtcars$hp))
lines(lowess(mtcars$logmpg~I(1/mtcars$hp)))

plot(mtcars$logmpg~I(1/mtcars$disp))
lines(lowess(mtcars$logmpg~I(1/mtcars$disp)))

df <- with(data = mtcars, matrix(
  c(1/disp, 1/hp), 
  byrow = FALSE,
  nrow = length(mtcars$logmpg),
))
colnames(df) <- c("1/disp", "1/hp")
cor(mtcars$logmpg, df, method = "pearson")
1/disp 1/hp
0.8913563 0.8366077

I grafici e il valore degli indici di pearson confermano la bontà delle due trasformate.

After fitting the model fit <- lm(mpg ∼ disp + hp + wt + drat, data=mtcars), the following outputs are obtained by the R commands summary(fit) and plot(fit), respectively. Describe how to interpret these results, and then suggest how to proceed with further analyses.

## 
## Call:
## lm(formula = mpg ~ disp + hp + wt + drat, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5077 -1.9052 -0.5057  0.9821  5.6883 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.148738   6.293588   4.631  8.2e-05 ***
## disp         0.003815   0.010805   0.353  0.72675    
## hp          -0.034784   0.011597  -2.999  0.00576 ** 
## wt          -3.479668   1.078371  -3.227  0.00327 ** 
## drat         1.768049   1.319779   1.340  0.19153    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.602 on 27 degrees of freedom
## Multiple R-squared:  0.8376, Adjusted R-squared:  0.8136 
## F-statistic: 34.82 on 4 and 27 DF,  p-value: 2.704e-10
plot(fit, which = c(1:6), lwd=2)

Dal summary si può notare l’intercetta molto elevata e gli elevati valori dei p-value soprattutto per quanto riguarda disp e drat, suggerendo il fatto che non siano significativi sulla variabile risposta.

Dai grafici diagnostici risulta infatti un’andamento dei residui non costante e sia il qq-plot che il grafico dei residui standardizzati confermano la pesantezza delle code già rilevate in fase di analisi esplorativa. Si può notare infatti come ai valori estremi il valore dei residui tenda ad aumentare.

Questa impressione è confermata anche dai grafici su effetto leva e distanza di Cook, con la presenza di tre osservazioni (Toyota Corolla, Chrysler Imperial, Maseratti Bora) evidenziata in praticamente tutti i grafici, a conferma della loro influenza sul valore di mpg nel modello.

Proponiamo un modello alternativo sulla base delle osservazioni fatte in fase di EDA

Modello alternativo

mtcars.fit <- lm(logmpg ~ cyl + wt + I(1/disp) + I(1/hp), data = mtcars)
summary(mtcars.fit)
## 
## Call:
## lm(formula = logmpg ~ cyl + wt + I(1/disp) + I(1/hp), data = mtcars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.15842 -0.08823 -0.02354  0.08879  0.21965 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.35417    0.26674  12.575 8.44e-13 ***
## cyl         -0.02094    0.02526  -0.829 0.414396    
## wt          -0.15294    0.03753  -4.076 0.000362 ***
## I(1/disp)    9.87334   15.63627   0.631 0.533064    
## I(1/hp)     19.59982   10.17911   1.925 0.064760 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1065 on 27 degrees of freedom
## Multiple R-squared:  0.8885, Adjusted R-squared:  0.872 
## F-statistic: 53.79 on 4 and 27 DF,  p-value: 1.785e-12
plot(mtcars.fit, which = 1:6)

Questo modello sembra essere decisamente migliore rispetto a quello proposto in precedenza (AIC molto più basso, indici \(R^2\) più elevati), tuttavia si nota sia un peggioramento nell’andamento sia del qq-plot che dei residui standardizzati, inoltre nonostante un p-value per il test F molto basso, i singoli p-value per i regressori sono elevati, sintomo di una collinearità non rilevata in precedenza.

library(DAAG)
## Loading required package: lattice
vif(mtcars.fit)
##       cyl        wt I(1/disp)   I(1/hp) 
##    5.5598    3.6819    8.5939    4.7799

L’analisi VIF conferma la presenza di collinearità nel modello.

mtcars.fit2 <- lm(logmpg ~ wt + I(1/hp), data = mtcars)
summary(mtcars.fit2)
## 
## Call:
## lm(formula = logmpg ~ wt + I(1/hp), data = mtcars)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.156884 -0.083000 -0.008951  0.083485  0.242816 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.29685    0.13518  24.389  < 2e-16 ***
## wt          -0.18351    0.02757  -6.656 2.68e-07 ***
## I(1/hp)     29.68441    6.56369   4.523 9.54e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1061 on 29 degrees of freedom
## Multiple R-squared:  0.8813, Adjusted R-squared:  0.8731 
## F-statistic: 107.6 on 2 and 29 DF,  p-value: 3.805e-14
plot(mtcars.fit2, which = 1:6)

vif(mtcars.fit2)
##      wt I(1/hp) 
##  2.0048  2.0048
AIC(fit, mtcars.fit, mtcars.fit2)
df AIC
fit 6 158.58366
mtcars.fit 6 -45.93601
mtcars.fit2 4 -47.92801
anova(mtcars.fit2, mtcars.fit)
Res.Df RSS Df Sum of Sq F Pr(>F)
29 0.3263152 NA NA NA NA
27 0.3064681 2 0.0198471 0.8742682 0.4286456

Il nuovo modello sembra migliore, come confermato dall’analisi vif, dai valori dei p-value sui regressori e globale, dall’analisi ANOVA e dalla statistica AIC

Compito del 13 febbraio 2017

Let us consider the dataframe bp.obese of the library ISwR, which comprises information about sex, obesity and blood pressure for a random sample of 102 Mexican-American adults in a small California town. The help file and the output of the str command are given below

## 
## Attaching package: 'ISwR'
## The following object is masked from 'package:DAAG':
## 
##     lung
bp.obese R Documentation

Obesity and blood pressure

Format

This data frame contains the following columns:

sex

a numeric vector code, 0: male, 1: female.

obese

a numeric vector, ratio of actual weight to ideal weight from New York Metropolitan Life Tables.

bp

a numeric vector,systolic blood pressure (mm Hg).

sex obese bp
Min. :0.0000 Min. :0.810 Min. : 94.0
1st Qu.:0.0000 1st Qu.:1.143 1st Qu.:116.0
Median :1.0000 Median :1.285 Median :124.0
Mean :0.5686 Mean :1.313 Mean :127.0
3rd Qu.:1.0000 3rd Qu.:1.430 3rd Qu.:137.5
Max. :1.0000 Max. :2.390 Max. :208.0

The aim of the study is to analyze the potential relationship between blood pressure, which is the response variable, and obesity, taking into account also the factor regressor sex. Describe how to perform a preliminary data analysis on this dataframe, using suitable R commands and comment the following plot.

### Analisi preliminare

Partiamo con una panoramica rapida del dataset.

help(bp.obese)
bp.obese R Documentation

Obesity and blood pressure

Description

The bp.obese data frame has 102 rows and 3 columns. It contains data from a random sample of Mexican-American adults in a small California town.

Usage

bp.obese

Format

This data frame contains the following columns:

sex

a numeric vector code, 0: male, 1: female.

obese

a numeric vector, ratio of actual weight to ideal weight from New York Metropolitan Life Tables.

bp

a numeric vector,systolic blood pressure (mm Hg).

Source

B.W. Brown and M. Hollander (1977), Statistics: A Biomedical Introduction, Wiley.

Examples

plot(bp~obese,pch = ifelse(sex==1, "F", "M"), data = bp.obese)
summary(bp.obese)
sex obese bp
Min. :0.0000 Min. :0.810 Min. : 94.0
1st Qu.:0.0000 1st Qu.:1.143 1st Qu.:116.0
Median :1.0000 Median :1.285 Median :124.0
Mean :0.5686 Mean :1.313 Mean :127.0
3rd Qu.:1.0000 3rd Qu.:1.430 3rd Qu.:137.5
Max. :1.0000 Max. :2.390 Max. :208.0

Il dataset è costituito da tre variabili, due numeriche continue (obese e bp) e una categoriale non ordinale (sex).

After fitting these linear models fit1 <- lm(bp ∼ obese,data=bp.obese), fit2 <- lm(bp ∼ obese+sex,data=bp.obese) and fit3 <- lm(bp ∼ obese*sex,data=bp.obese), the following outputs are obtained by the R function summary.

## 
## Call:
## lm(formula = bp ~ obese, data = bp.obese)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.570 -11.241  -2.400   9.116  71.390 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   96.818      8.920   10.86  < 2e-16 ***
## obese         23.001      6.667    3.45 0.000822 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.28 on 100 degrees of freedom
## Multiple R-squared:  0.1064, Adjusted R-squared:  0.09743 
## F-statistic:  11.9 on 1 and 100 DF,  p-value: 0.0008222
## 
## Call:
## lm(formula = bp ~ obese + sex, data = bp.obese)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.263 -11.613  -2.057   6.424  72.207 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   93.287      8.937  10.438  < 2e-16 ***
## obese         29.038      7.172   4.049 0.000102 ***
## sex           -7.730      3.715  -2.081 0.040053 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17 on 99 degrees of freedom
## Multiple R-squared:  0.1438, Adjusted R-squared:  0.1265 
## F-statistic: 8.314 on 2 and 99 DF,  p-value: 0.0004596
## 
## Call:
## lm(formula = bp ~ obese * sex, data = bp.obese)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.645 -11.621  -1.708   6.737  71.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  102.112     18.231   5.601 1.95e-07 ***
## obese         21.646     15.118   1.432    0.155    
## sex          -19.596     21.664  -0.905    0.368    
## obese:sex      9.558     17.191   0.556    0.579    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.05 on 98 degrees of freedom
## Multiple R-squared:  0.1465, Adjusted R-squared:  0.1204 
## F-statistic: 5.607 on 3 and 98 DF,  p-value: 0.001368

Describe how to interpret these results, and then suggest how to proceed with further analyses.

Compito del 15 febbraio 2018

Let us consider the dataframe SAheart of the library ElemStatLearn, which comprises information about a retrospective sample of males in a heart-disease high-risk region of the Western Cape, South Africa. The help file and the output of the str command are given below

## 
## Attaching package: 'ElemStatLearn'
## The following object is masked from 'package:DAAG':
## 
##     ozone
SAheart R Documentation

South African Hearth Disease Data

Format

A data frame with 462 observations on the following 10 variables.

sbp

systolic blood pressure

tobacco

cumulative tobacco (kg)

ldl

low density lipoprotein cholesterol

adiposity

a numeric vector

famhist

family history of heart disease, a factor with levels Absent Present

typea

type-A behavior

obesity

a numeric vector

alcohol

current alcohol consumption

age

age at onset

chd

response, coronary heart disease

## 'data.frame':    462 obs. of  10 variables:
##  $ sbp      : int  160 144 118 170 134 132 142 114 114 132 ...
##  $ tobacco  : num  12 0.01 0.08 7.5 13.6 6.2 4.05 4.08 0 0 ...
##  $ ldl      : num  5.73 4.41 3.48 6.41 3.5 6.47 3.38 4.59 3.83 5.8 ...
##  $ adiposity: num  23.1 28.6 32.3 38 27.8 ...
##  $ famhist  : Factor w/ 2 levels "Absent","Present": 2 1 2 2 2 2 1 2 2 2 ...
##  $ typea    : int  49 55 52 51 60 62 59 62 49 69 ...
##  $ obesity  : num  25.3 28.9 29.1 32 26 ...
##  $ alcohol  : num  97.2 2.06 3.81 24.26 57.34 ...
##  $ age      : int  52 63 46 58 49 45 38 58 29 53 ...
##  $ chd      : int  1 1 0 1 1 0 0 1 0 1 ...

The aim of the study is to analyze the potential relationship between the binary response variable chd and the explanatory variables considered in the dataframe. Describe how to perform a preli- minary data analysis on this dataframe, using suitable R commands, and comment the following plot.

With the command mod0 <- glm(chd ∼ ldl, data = SAheart, family = binomial), a simple logistic regression model is defined for describing the potential effect of the level of ldl on the probability of coronary heart disease. Comment the model fitting outcomes given by the function summary.

## 
## Call:
## glm(formula = chd ~ ldl, family = binomial, data = SAheart)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1647  -0.8948  -0.7426   1.2688   1.8637  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.96867    0.27308  -7.209 5.63e-13 ***
## ldl          0.27466    0.05164   5.319 1.04e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 596.11  on 461  degrees of freedom
## Residual deviance: 564.28  on 460  degrees of freedom
## AIC: 568.28
## 
## Number of Fisher Scoring iterations: 4

After fitting these two further logistic regression models mod1 <- glm(chd ∼ ., data = SAheart, family = binomial) and mod2 <- glm(chd ∼ tobacco + ldl + famhist + typea + age + ldl:famhist, data = SAheart, family = binomial), the following outputs are obtained by the R function summary.

## 
## Call:
## glm(formula = chd ~ ., family = binomial, data = SAheart)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7781  -0.8213  -0.4387   0.8889   2.5435  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -6.1507209  1.3082600  -4.701 2.58e-06 ***
## sbp             0.0065040  0.0057304   1.135 0.256374    
## tobacco         0.0793764  0.0266028   2.984 0.002847 ** 
## ldl             0.1739239  0.0596617   2.915 0.003555 ** 
## adiposity       0.0185866  0.0292894   0.635 0.525700    
## famhistPresent  0.9253704  0.2278940   4.061 4.90e-05 ***
## typea           0.0395950  0.0123202   3.214 0.001310 ** 
## obesity        -0.0629099  0.0442477  -1.422 0.155095    
## alcohol         0.0001217  0.0044832   0.027 0.978350    
## age             0.0452253  0.0121298   3.728 0.000193 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 596.11  on 461  degrees of freedom
## Residual deviance: 472.14  on 452  degrees of freedom
## AIC: 492.14
## 
## Number of Fisher Scoring iterations: 5
## 
## Call:
## glm(formula = chd ~ tobacco + ldl + famhist + typea + age + ldl:famhist, 
##     family = binomial, data = SAheart)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8463  -0.7938  -0.4419   0.9161   2.4956  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -5.79224    0.94625  -6.121 9.28e-10 ***
## tobacco             0.08496    0.02628   3.233  0.00122 ** 
## ldl                 0.01758    0.07302   0.241  0.80974    
## famhistPresent     -0.77068    0.62341  -1.236  0.21637    
## typea               0.03690    0.01240   2.974  0.00294 ** 
## age                 0.05140    0.01030   4.990 6.03e-07 ***
## ldl:famhistPresent  0.33334    0.11595   2.875  0.00404 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 596.11  on 461  degrees of freedom
## Residual deviance: 466.90  on 455  degrees of freedom
## AIC: 480.9
## 
## Number of Fisher Scoring iterations: 5

Describe how to interpret these results, and then suggest how to proceed with further analyses.

Compito del 11 giugno 2018

Let us consider the dataframe house, which includes information about the price, the size, the floor, the number of bedrooms (bed) and the number of bathrooms (bath) of 546 houses. The output of the str command is given below

## 'data.frame':    546 obs. of  5 variables:
##  $ price: num  42000 38500 49500 60500 61000 66000 66000 69000 83800 88500 ...
##  $ size : int  5850 4000 3060 6650 6360 4160 3880 4160 4800 5500 ...
##  $ bed  : int  3 2 3 3 2 3 3 3 3 3 ...
##  $ bath : int  1 1 1 1 1 1 2 1 1 2 ...
##  $ floor: int  2 1 1 2 1 1 2 3 1 4 ...

A suitable linear regression model can be defined in order to study the potential relationship between the price, which is the response variable, and the explanatory variables considered in the dataframe. Describe how to perform a preliminary data analysis on this dataframe, using suitable R commands. Moreover, consider the following plots and discuss the possibility of measuring the variables price and size in the logarithmic scale.

After fitting the regression model fit <- lm(log(price) ∼ log(size) + bed + bath + floor, data=house), the following outputs are obtained by the R commands summary(fit) and plot(fit), respectively.

Describe how to interpret these results, and then suggest how to proceed with further analyses with particular regard to prediction.

Compito del 4 febbraio 2019

Let us consider the dataframe wages, which containes information about 3294 USA working individuals. The data are taken from the National Longitudinal Survey and are related to 1987. The variable as are listed below and the output of the str command is given

A data frame with 3294 observations on the following 4 variables.
exper
   experience in years
male
  1 male, 0 female
school
   years of schooling
wage
   wage (in 1980$) per hour
region
   Center, North, South
## 'data.frame':    3296 obs. of  5 variables:
##  $ exper : int  9 12 11 9 8 9 8 10 12 7 ...
##  $ male  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ school: int  13 12 11 14 14 14 12 12 10 12 ...
##  $ wage  : num  6.32 5.48 3.64 4.59 2.42 ...
##  $ region: Factor w/ 3 levels "Center","North",..: 1 1 3 1 1 1 1 1 1 3 ...

The aim of the study is to analyze the potential relationship between the response variable wage and the explanatory variables considered in the dataframe. Describe how to perform a preliminary data analysis on this dataframe, using suitable R commands. Moreover, consider the following plots and discuss the possibility of measuring the variable wage in the logarithmic scale

In order to describe the effect of the factor male on the response log(wage) we may analize this plot, where the probability distribution of log(wage) is represented by considering the kernel density estimates conditioned on the two levels (1 male, 0 female) of the variable male

With the commands mod.0<-lm(log(wage) ∼ male,data=wages) and mod.1<-lm(log(wage) ∼ exper*male, data=wages), two regression models are defined for describing the potential effect of male and exper on the response log(wage). Comment the model fitting outcomes given by the function summary (Hint: consider the fact that the average years of experience in the sample is lower for women than for men).

## 
## Call:
## lm(formula = log(wage) ~ male, data = wages)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0445 -0.3073  0.0544  0.3839  3.0325 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.47475    0.01559   94.59   <2e-16 ***
## male1        0.21826    0.02154   10.13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6176 on 3294 degrees of freedom
## Multiple R-squared:  0.03023,    Adjusted R-squared:  0.02994 
## F-statistic: 102.7 on 1 and 3294 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = log(wage) ~ exper * male, data = wages)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0906 -0.3050  0.0560  0.3792  3.0468 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.193551   0.057470  20.768  < 2e-16 ***
## exper        0.036367   0.007156   5.082 3.94e-07 ***
## male1        0.463707   0.079062   5.865 4.93e-09 ***
## exper:male1 -0.032071   0.009518  -3.369 0.000762 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6153 on 3292 degrees of freedom
## Multiple R-squared:  0.03792,    Adjusted R-squared:  0.03704 
## F-statistic: 43.25 on 3 and 3292 DF,  p-value: < 2.2e-16

Finally, a complete regression model is fitted mod.2 <- lm(log(wage) ∼., data=wages) and the following output is obtained by the R function summary.

## 
## Call:
## lm(formula = log(wage) ~ ., data = wages)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0008 -0.2821  0.0468  0.3673  3.2337 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.279709   0.090321  -3.097  0.00197 ** 
## exper        0.034618   0.004549   7.610 3.55e-14 ***
## male1        0.246474   0.020607  11.961  < 2e-16 ***
## school       0.122909   0.006278  19.578  < 2e-16 ***
## regionNorth  0.051107   0.024505   2.086  0.03709 *  
## regionSouth  0.047168   0.024969   1.889  0.05898 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5831 on 3290 degrees of freedom
## Multiple R-squared:  0.1364, Adjusted R-squared:  0.1351 
## F-statistic: 103.9 on 5 and 3290 DF,  p-value: < 2.2e-16

Describe how to interpret these results, and then suggest how to proceed with further analyses.

Compito del 21 febbraio 2019

Let us consider the data frame loan, which contains information about 42,535 loans ranging from 1,000 $ to 35,000 $, issued by a company called Lending Club. The following variables are considered: good (the behaviour of the client with values good and bad), fico (the FICO credit score measuring the client credit worthiness), purpose (the intended use of the loan, with 8 different categories), loan amt (the credit amount in $) and income (the annual income in $ of the client). The variable as are listed below and the output of the str command is given

## 'data.frame':    42535 obs. of  5 variables:
##  $ good     : Factor w/ 2 levels "bad","good": 2 1 2 2 2 2 2 2 1 1 ...
##  $ purpose  : Factor w/ 8 levels "debt_consolidation",..: 1 4 7 6 6 8 1 4 7 6 ...
##  $ fico     : int  737 742 737 692 697 732 692 662 677 727 ...
##  $ loan_amnt: int  5000 2500 2400 10000 3000 5000 7000 3000 5600 5375 ...
##  $ income   : num  24000 30000 NA 49200 80000 36000 NA 48000 40000 15000 ...

Moreover, the output of the command summary is also given

good purpose fico loan_amnt income
bad : 6371 debt_consolidation:25253 Min. :612.0 Min. : 500 Min. : 4800
good:36164 other : 5160 1st Qu.:687.0 1st Qu.: 5200 1st Qu.: 44995
NA major_purchase : 3926 Median :712.0 Median : 9700 Median : 63000
NA home_improvement : 3625 Mean :715.1 Mean :11090 Mean : 75186
NA small_business : 1992 3rd Qu.:742.0 3rd Qu.:15000 3rd Qu.: 90000
NA vacation_wedding : 1404 Max. :827.0 Max. :35000 Max. :6000000
NA (Other) : 1175 NA NA NA’s :18758

The aim of the study is to analyze the potential relationship between the response variable good and the explanatory variables considered in the data frame, in order to evaluate the possible good/bad behaviour of a customer. Describe how to perform a preliminary data analysis on this data frame, using suitable R commands. Moreover, consider and discuss the following plots

In order to describe the effect of the factor fico on the response good we consider a simple logistic regression model fitted using the command mod.1<-glm(good ∼ fico, data = loan, family = "binomial"). Comment the model fitting outcomes given by the function summary and the output given by the subsequent commands.

mod.1 <- glm(good ~ fico, data = loan, family = "binomial")
summary(mod.1)
## 
## Call:
## glm(formula = good ~ fico, family = "binomial", data = loan)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5172   0.4078   0.5306   0.6294   0.9622  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -7.033068   0.296922  -23.69   <2e-16 ***
## fico         0.012358   0.000421   29.35   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 35928  on 42534  degrees of freedom
## Residual deviance: 34985  on 42533  degrees of freedom
## AIC: 34989
## 
## Number of Fisher Scoring iterations: 5
exp(coef(mod.1))
##  (Intercept)         fico 
## 0.0008822209 1.0124345145
test <- data.frame(fico=c(700,750))
test$pred <- predict(mod.1,test, type="response")
test
fico pred
700 0.8344391
750 0.9033761

Two further logistic regression models are fitted using mod.2<-glm(good ∼ fico + loan_amnt, data = loan, family = "binomial") and mod.3<-glm(good ∼ fico + loan_amnt + income + purpose, data = loan, family = "binomial"). Comment the corresponding output obtained by the R function summary and then suggest how to proceed with a further predictive analysis.

## 
## Call:
## glm(formula = good ~ fico + loan_amnt, family = "binomial", data = loan)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6261   0.4011   0.5256   0.6261   0.9423  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -7.367e+00  3.007e-01  -24.50   <2e-16 ***
## fico         1.319e-02  4.306e-04   30.62   <2e-16 ***
## loan_amnt   -2.229e-05  1.815e-06  -12.28   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 35928  on 42534  degrees of freedom
## Residual deviance: 34838  on 42532  degrees of freedom
## AIC: 34844
## 
## Number of Fisher Scoring iterations: 5
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Call:
## glm(formula = good ~ fico + loan_amnt + income + purpose, family = "binomial", 
##     data = loan)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3659   0.3840   0.5224   0.6320   1.2589  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -7.482e+00  4.119e-01 -18.165  < 2e-16 ***
## fico                     1.303e-02  5.906e-04  22.061  < 2e-16 ***
## loan_amnt               -3.663e-05  2.580e-06 -14.200  < 2e-16 ***
## income                   7.203e-06  5.426e-07  13.275  < 2e-16 ***
## purposeeducational      -5.076e-01  2.309e-01  -2.198   0.0279 *  
## purposehome_improvement -1.077e-01  7.108e-02  -1.515   0.1298    
## purposemajor_purchase    1.388e-02  7.689e-02   0.180   0.8568    
## purposemedical          -2.678e-01  1.426e-01  -1.878   0.0604 .  
## purposeother            -2.988e-01  6.034e-02  -4.952 7.34e-07 ***
## purposesmall_business   -9.158e-01  7.016e-02 -13.052  < 2e-16 ***
## purposevacation_wedding  1.114e-01  1.126e-01   0.989   0.3226    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20592  on 23776  degrees of freedom
## Residual deviance: 19677  on 23766  degrees of freedom
##   (18758 observations deleted due to missingness)
## AIC: 19699
## 
## Number of Fisher Scoring iterations: 5

Compito del 28 gennaio 2020

Let us consider the dataframe birthwt, which contains data on 189 births at the Baystate Medical Centre, Springfield, Massachusetts during 1986. The focus is on the variables listed below

## 
## Attaching package: 'MASS'
## The following object is masked from 'package:DAAG':
## 
##     hills
birthwt R Documentation

Risk Factors Associated with Low Infant Birth Weight

Format

This data frame contains the following columns:

low

indicator of birth weight less than 2.5 kg.

age

mother’s age in years.

lwt

mother’s weight in pounds at last menstrual period.

race

mother’s race (1 = white, 2 = black, 3 = other).

smoke

smoking status during pregnancy.

ptl

number of previous premature labours.

ht

history of hypertension.

ui

presence of uterine irritability.

ftv

number of physician visits during the first trimester.

bwt

birth weight in grams.

The aim of the study is to analyze the potential relationship between the response variable bwt and the explanatory variables age and race. Describe how to perform a preliminary data analysis on this dataframe using suitable R commands and comment the following plots

In order to describe the potential relationship between birth weight and age, taking into account also the factor race, we compare the following nested models

bwt.lm1 <- lm(bwt ~ 1 , data = birthwt)
bwt.lm2 <- lm(bwt ~ age, data = birthwt)
bwt.lm3 <- lm(bwt ~ race + age, data = birthwt)
bwt.lm4 <- lm(bwt ~ race*age, data = birthwt)

Describe the four models and comment the results given by the Analysis of Variance Table, reported below. Moreover, propose some alternative model selection procedures.

anova(bwt.lm1, bwt.lm2, bwt.lm3, bwt.lm4)
Res.Df RSS Df Sum of Sq F Pr(>F)
188 99969656 NA NA NA NA
187 99154173 1 815483.2 1.590830 0.2087958
186 95848562 1 3305610.3 6.448524 0.0119274
185 94833785 1 1014776.9 1.979608 0.1611095

Let us consider Model 3 and comment the output obtained by the R functions summary and plot.

## 
## Call:
## lm(formula = bwt ~ race + age, data = birthwt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2104.8  -485.7    11.3   536.0  1746.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3028.482    277.679  10.906   <2e-16 ***
## race        -146.597     57.881  -2.533   0.0121 *  
## age            8.039     10.032   0.801   0.4240    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 717.9 on 186 degrees of freedom
## Multiple R-squared:  0.04122,    Adjusted R-squared:  0.03091 
## F-statistic: 3.999 on 2 and 186 DF,  p-value: 0.01994

Finally, discuss the following graphical output and then suggest how to proceed with further analyses.

Compito del 18 febbraio 2020

Let us consider the dataframe wine, which contains information about 178 samples of wines grown in the same region in Italy. The cultivar of each wine sample is observed (variable cultivar, with labels 1, 2, 3), together with the concentration of the 13 different chemicals (variables V1-V13). Describe how to perform a preliminary data analysis on this dataframe using suitable R commands and comment the following outputs.

wine <- read.table("./data/wine.txt", header = TRUE)
summary(wine)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
Min. :1.000 Min. :11.03 Min. :0.740 Min. :1.360 Min. :10.60 Min. : 70.00 Min. :0.980 Min. :0.340 Min. :0.1300 Min. :0.410 Min. : 1.280 Min. :0.4800 Min. :1.270 Min. : 278.0
1st Qu.:1.000 1st Qu.:12.36 1st Qu.:1.603 1st Qu.:2.210 1st Qu.:17.20 1st Qu.: 88.00 1st Qu.:1.742 1st Qu.:1.205 1st Qu.:0.2700 1st Qu.:1.250 1st Qu.: 3.220 1st Qu.:0.7825 1st Qu.:1.938 1st Qu.: 500.5
Median :2.000 Median :13.05 Median :1.865 Median :2.360 Median :19.50 Median : 98.00 Median :2.355 Median :2.135 Median :0.3400 Median :1.555 Median : 4.690 Median :0.9650 Median :2.780 Median : 673.5
Mean :1.938 Mean :13.00 Mean :2.336 Mean :2.367 Mean :19.49 Mean : 99.74 Mean :2.295 Mean :2.029 Mean :0.3619 Mean :1.591 Mean : 5.058 Mean :0.9574 Mean :2.612 Mean : 746.9
3rd Qu.:3.000 3rd Qu.:13.68 3rd Qu.:3.083 3rd Qu.:2.558 3rd Qu.:21.50 3rd Qu.:107.00 3rd Qu.:2.800 3rd Qu.:2.875 3rd Qu.:0.4375 3rd Qu.:1.950 3rd Qu.: 6.200 3rd Qu.:1.1200 3rd Qu.:3.170 3rd Qu.: 985.0
Max. :3.000 Max. :14.83 Max. :5.800 Max. :3.230 Max. :30.00 Max. :162.00 Max. :3.880 Max. :5.080 Max. :0.6600 Max. :3.580 Max. :13.000 Max. :1.7100 Max. :4.000 Max. :1680.0
sapply(wine[2:14],sd)
##          V2          V3          V4          V5          V6          V7 
##   0.8118265   1.1171461   0.2743440   3.3395638  14.2824835   0.6258510 
##          V8          V9         V10         V11         V12         V13 
##   0.9988587   0.1244533   0.5723589   2.3182859   0.2285716   0.7099904 
##         V14 
## 314.9074743

Moreover, discuss the results given by the scatterplot matrix considered below, which considers the first 5 numerical variables, with colours indicating cultivar 1 (black), cultivar 2 (red) and cultivar 3 (blue).

The aim of the study is to adequately synthesize the information given by the original variables V1-V13, in order to capture as much of the information as possible. A further objective is to use some of these new derived variable for distinguishing the three different cultivars. The Principal Components Analysis procedure is applied. Present the main features of this stati- stical procedure, describe the arguments specified below in the function princomp and discuss the output of the function loadings.

wine.pca <- princomp(wine[2:14], cor=TRUE)
loadings(wine.pca)[,1:4]
Comp.1 Comp.2 Comp.3 Comp.4
V2 0.1443294 0.4836515 0.2073826 0.0178563
V3 -0.2451876 0.2249309 -0.0890129 -0.5368903
V4 -0.0020511 0.3160688 -0.6262239 0.2141756
V5 -0.2393204 -0.0105905 -0.6120803 -0.0608594
V6 0.1419920 0.2996340 -0.1307569 0.3517966
V7 0.3946608 0.0650395 -0.1461790 -0.1980683
V8 0.4229343 -0.0033598 -0.1506819 -0.1522948
V9 -0.2985331 0.0287795 -0.1703682 0.2033010
V10 0.3134295 0.0393017 -0.1494543 -0.3990565
V11 -0.0886167 0.5299957 0.1373062 -0.0659257
V12 0.2967146 -0.2792351 -0.0852219 0.4277714
V13 0.3761674 -0.1644962 -0.1660046 -0.1841207
V14 0.2867522 0.3649028 0.1267459 0.2320709

Moreover, discuss the following graphical outputs

Finally, comment this last plot, with particular concern to the aim of characterizing the three different cultivars.